feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg)#70
Merged
ishandhanani merged 4 commits intoNVIDIA:mainfrom Apr 24, 2026
Merged
feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg)#70ishandhanani merged 4 commits intoNVIDIA:mainfrom
ishandhanani merged 4 commits intoNVIDIA:mainfrom
Conversation
Adds eight SGLang recipes covering NVIDIA-verified DeepSeek-V4-Pro
(1.6T MXFP4 MoE) aggregated-serving configurations on Grace+Blackwell:
recipes/gb300-fp4/1k1k-dsv4/
agg-low-latency.yaml — TP=4 + MTP 3/4 (min TPOT)
agg-nomtp.yaml — TP=4 (baseline)
agg-balanced-tep.yaml — TP=4+DP=4 DeepEP + MTP 1/2
agg-max-tpt-tep.yaml — TP=4+DP=4 DeepEP (max TPS/GPU)
agg-2n-low-latency.yaml — TP=8 + MTP 3/4
agg-2n-nomtp.yaml — TP=8
recipes/gb200-fp4/1k1k-dsv4/
agg-2n-low-latency.yaml — TP=8 + MTP 3/4
agg-2n-nomtp.yaml — TP=8
Flag set derived from the SGLang DSv4 cookbook
(docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx):
* moe-runner-backend: flashinfer_mxfp4 (MXFP4 MoE on Blackwell)
* chunked-prefill-size: 4096 + disable-flashinfer-autotune: true
* EAGLE spec-decoding 3/4 for low-latency, 1/2 for balanced
* TEP recipes: enable-dp-attention + moe-a2a-backend: deepep +
deepep-config num_sms=96 (DEEPEP_LARGE_SMS_FLAG, single-node Blackwell)
* disable-radix-cache: true (synthetic bench best practice, also
reduces allocator fragmentation during MXFP4 weight-reorder)
* mem-fraction-static: 0.78 (0.82 intermittently OOMs GB300 during
reorder_w1w3_to_w3w1; 0.78 leaves contiguous headroom)
srtslurm.yaml.example: added deepseek-v4 model + container aliases.
Also adds README.md in each recipe subdir with rationale + reference
pointers to the SGLang cookbook and DSv4-Pro model card.
Made-with: Cursor
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #70 +/- ##
=======================================
Coverage ? 70.35%
=======================================
Files ? 59
Lines ? 6270
Branches ? 0
=======================================
Hits ? 4411
Misses ? 1859
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…ARMUP Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 24, 2026
YAMY1234
added a commit
to YAMY1234/srt-slurm-upstream
that referenced
this pull request
Apr 24, 2026
Adapted from @ishandhanani's dynamo-frontend disagg recipe. 1 prefill + 1 decode node at TP=4, MXFP4 FlashInfer MoE runner, NIXL transfer backend. Benchmark type is ``manual`` for external sweep control. Lives under ``recipes/gb300-fp4/1k1k-dsv4/`` to match PR NVIDIA#70's hardware-partitioned layout. Uses ``dsv4-pro`` checkpoint and ``dsv4-grace-blackwell`` container (GB300 aarch64 sglang image), since DSV4-Flash weights are not staged and the original ``dsflash`` / ``dspro`` / ``nginx`` aliases from the upstream recipe are local to @ishandhanani's environment. Dynamo hash pinned: 9d3c913d300eb368cda28b3f98a23a5762621e0d Made-with: Cursor
ishandhanani
pushed a commit
that referenced
this pull request
Apr 24, 2026
* feat(sa-bench): add sglang DeepSeek-V4 tokenizer
Adds a client-side tokenizer for DeepSeek-V4-Pro that matches sglang
server behavior. Usable via the existing 'module.path.ClassName' hook
in backend_request_func.get_tokenizer; no changes to sa-bench itself.
Motivation: DeepSeek-V4 ships no HF chat template, so
tokenizer.apply_chat_template() raises ValueError. Sglang's server
replaces the HF path with a hard-coded DSML encoder
(encoding_dsv4.encode_messages) whenever arch == 'DeepseekV4ForCausalLM',
per sgl-project/sglang PR #23600. Without a matching client-side
encoder, sa-bench input_tokens diverges from server #new-token.
Implementation:
- sa_bench_tokenizers/_sglang_encoding_dsv4.py: vendored byte-exact
from sgl-project/sglang@f5d03db (Apache-2.0, 840 lines).
- sa_bench_tokenizers/sglang_deepseek_v4.py: HF-compatible wrapper.
apply_chat_template() mirrors serving_chat exactly:
1) insert empty system message if missing
2) thinking_mode='chat', reasoning_effort=None (defaults)
3) call encode_messages(...)
4) hf_tokenizer.encode(..., add_special_tokens=False)
Usage (recipe):
benchmark:
custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer"
Recipes: recipe YAML authoring is out of scope; see #70
for DeepSeek-V4-Pro sglang recipes.
* fix(sa-bench): load DeepSeek-V4 via PreTrainedTokenizerFast first
AutoTokenizer.from_pretrained rejects checkpoints whose model_type
(deepseek_v4) is not yet registered in mainline transformers. The V4
checkpoint ships a ready-made tokenizer.json, so prefer loading it
directly through PreTrainedTokenizerFast and only fall back to
AutoTokenizer (for future transformers releases that register V4).
Verified offline on DeepSeek-V4-Pro: 4 representative prompts
(hello, GSM8K, system+user, multi-turn) each produce client token IDs
byte-identical to the sglang server path
tokenizer.encode(encode_messages(msgs_with_empty_system,
thinking_mode=chat)).
* feat(recipes): add gb300 1-node agg smoke recipe for SGLang DeepSeek-V4 tokenizer
Minimal recipe that exercises the new SGLangDeepseekV4Tokenizer wrapper end
to end on a single GB300 node (TP=4, MTP 3/4, MXFP4 MoE). Used to verify
client-side prompt encoding aligns with the SGLang server for DeepSeek-V4.
Key knobs:
- mem-fraction-static: 0.82 (required on 1-node: DSv4-Pro weights occupy
~206 GB/GPU, so a lower mfs leaves negative KV pool after SGLang reserves
(1-mfs)*total_mem for activations, triggering "Not enough memory")
- use_chat_template: true
- custom_tokenizer: sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer
* chore: add NVIDIA SPDX copyright headers to sa_bench_tokenizers files
Required by NVIDIA/srt-slurm CI license check.
* feat(sa-bench): mirror SGLANG_ENABLE_THINKING / SGLANG_REASONING_EFFORT env fallback
The client-side DSV4 tokenizer renders prompts via the vendored
``encoding_dsv4.encode_messages`` and previously hard-coded
``thinking_mode="chat"`` and ``reasoning_effort=None``. This matches the
sglang server only when the user has not set the DSV4 thinking knobs.
sglang ``serving_chat.py`` (PR #23600) honors two envs as fallbacks:
- ``SGLANG_ENABLE_THINKING=1`` -> ``thinking_mode="thinking"``
- ``SGLANG_REASONING_EFFORT=max|high`` -> passed through to the encoder
Recipes typically set these in ``prefill_environment`` /
``decode_environment`` for reasoning eval workloads (gpqa, aime25, etc.)
on DeepSeek-V4-Flash. Without matching fallback on the sa-bench client,
server prompts would be wrapped in ``<think>...</think>`` while the
client still rendered the chat template, desynchronizing ISL / TPOT /
MTP accept-rate accounting.
This change:
- Adds ``_env_enable_thinking()`` / ``_env_reasoning_effort()`` helpers
that parse the envs the same way sglang ``EnvBool`` / ``EnvStr`` do
(``{1,true,yes,on}`` truthy; ``max|high`` filter).
- Changes ``apply_chat_template`` defaults from Python literals to
``None`` sentinels; when ``None``, falls back to env. Explicit caller
kwargs (incl. ``thinking=False``) still win, matching the server's
``(request.chat_template_kwargs or {}).get("thinking", env_default)``
precedence.
- Expands the docstring to show the real server call chain (not just
the happy-path defaults).
Smoke-tested all five precedence cases (no env + no kwarg / env on +
no kwarg / env on + explicit False / env off + explicit True / bogus
env value filtered).
Made-with: Cursor
7 tasks
This was referenced Apr 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Based on #69